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(54) Highly scalable parallel processing computer system architecture 



(57) A highly-scalable parallel processing computer 
system architecture is described. The parallel process- 
ing system comprises a plurality of compute nodes 
(200) for executing applications, a plurality of I/O nodes 
(212), each communicatively coupled to a plurality of 
storage resources (not shown), and an interconnect fab- 
ric (106) providing communication between any of the 
compute nodes and any of the I/O nodes. The intercon- 
nect fabric (106) comprises a network for connecting 
the compute nodes and the I/O nodes the network com- 

FIG. 8 



prising a plurality of switch nodes (810) arranged into 
more than g(log b N) switch node stages (812), wherein b 
is a total number of switch node input/output ports 
(814), N is a total number of network input/output ports 
(816) and g(x) indicates a ceiling function providing the 
smallest integer not less than the argument x, the switch 
node stages thereby providing a plurality of paths 
between any network input port and network output 
port. 




RECEIVE— 

SIDE 
INTERFACE 



-808 



812 



, 818 {BOUNCEfiACK POINT) 



iSWTTCH 




SWITCH 


] NODE 




NODE 



SWITCH 
NODE 



i SWITCH 
1 NODE 



SWITCH 
NQ0E 




SWITCH 
NOOE 








swrrcH 

NOOE 


SWITCH 
NOOE 



SWITCH 
NODE 




SWTTCH 
NODE 






SWTTCH 
NOOE 




SWITCH 
NODE 






-804 



,200 



COMPUTE 
NODE A 



BYNET 



Primed by Xerox (UK) Business Services 
2.16.7/3.6 



EP 0 935 200 A1 



Description 



n ~ e CP? nL^m»tfh ^ diSl li /0 performance has growing at a much slower rate overall than that of the 

nomenon has already manifested itself in existing large-system installations 9 * * Ph6 ' 

[0004] Uneven performance scaling is also occurring within the CPU. To improve CPU performance CPU vsndnr* 

per se»nd. the disk can or* tod. a fraction of that increase. TOeTw il^^ytafllra to^T 

Serpe p r m n re ce s s n * ues,o ' mw: — *»-«.-.«w:^Lt2: 

SS?! -!l' S a " ° bj6Ct * thS Present invention t0 am eliorate the above disadvantages 

En 4. PreS6nt inVenfon Pr0VidSS 8 ^ Pr0C6SSin9 SyStem as 861 out in accompanying Cairn 1 or accompany 
[0012] The present invention provides a highly-scaleable parallel processing conputer system architecture Th e r~ r 
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ing the compute nodes and the I/O nodes, the network comprising a plurality of switch nodes arranged into more than 
g(log b N) switch node stages, wherein b is a total number of switch node input/output ports, and g(x) indicates a ceiling 
function providing the smallest integer not less than the argument x, the switch node stages thereby providing a plurality 
of paths between any network input port and network output port. The switch node stages are configured to provide a 
5 plurality of bounceback points logically differentiating between switch nodes that load balance messages through the 
network from switch nodes that direct messages to receiving processors. 

[001 3] The present invention also provides a parallel processing computer system architecture in which each of a plu- 
rality of I/O nodes projects an image of storage objects to a plurality of compute nodes. 

[0014] Preferred features of the invention are set out in the accompanying dependent claims and will be apparent to 
w the skilled person on reading the following description of the invention which is given by way of example and in which: 

FIG. 1 is a top level block diagram of an embodiment of the present invention showing the key architectural ele- 
ments; 

FIG. 2 is a system block diagram of an embodiment of the present invention; 
is FIG. 3 is a block diagram showing the structure of the lONs and the system interconnect; 

FIG. 4 is a block diagram of the elements in a JBOD enclosure; 
FIG. 5 is a functional block diagram of the ION physical disk driver; 
FIG. 6 is a diagram showing the structure of fabric unique IDs; 

FIG. 7 is a functional block diagram showing the relationships between the ION Enclosure Management modules 
20 and the ION physical disk driver; 

FIG. 8 is a diagram of the BYNET host side interface; 
FIG. 9 is a diagram of the PIT header; 
FIG. 10 is a block diagram of the ION functional modules; and 
FIG. 11 is a diagram showing the ION dipole protocol. 

25 

Overview 

[001 5] FIG. 1 is an overview of the peer-to-peer architecture 1 00 of the present invention. This architecture comprises 
one or more compute resources 102 and one or more storage resources 104, communicatively coupled to the compute 
30 resources 102 via one or more interconnecting fabrics 106 and communication paths 108. The fabrics 106 provide the 
communication medium between all the nodes and storage, thus implementing a uniform peer access between com- 
pute resources 102 and storage resources 104. 

[001 6] In the architecture shown in FIG. 1 , storage is no longer bound to a single set of nodes as it is in current node- 
centric architectures, and any node can communicate with all of the storage. This contrasts with today s multi-node sys- 

35 terns where the physical system topology limits storage and node communication, and different topologies were often 
necessary to match different workloads. The architecture shown in FIG. 1 allows the communication patterns of the 
application software to determine the topology of the system at any given instance of time by providing a single physical 
architecture that supports a wide spectrum of system topologies, and embraces uneven technology growth. The isola- 
tion provided by the fabric 1 06 enables a fine grain scaling for each of the primary system components. ' * * - 

40 [001 7] FIG. 2 presents a more detailed description of the peer-to-peer architecture of the present invention. Compute 
resources 102 are defined by one or more compute nodes 200, each with one or more processors 216 implementing 
one or more applications 204 under control of an operating system 202. Operatively coupled to the compute node 200 
are peripherals 208 such as tape drives, printers, or other networks. Also operatively coupled to the compute node 200 
are local storage devices 210 such as hard disks, storing compute node 200 specific information, such as the instruc- 

45 tions comprising the operating system 202, applications 204, or other information. Application instructions may be 
stored and/or executed across more than one of the compute nodes 200 in a distributed processing fashion. In one 
embodiment, processor 216 comprises an off-the-shelf commercially available multi-purpose processor, such as the 
INTEL P6, and associated memory and I/O elements. 

[0018] Storage resources 104 are defined by cliques 226, each of which include a first I/O node or ION 212 and a 
so second I/O node or ION 214, each operatively coupled by system interconnect 228 to each of the interconnect fabrics 
106. The first ION 212 and second ION 214 are operatively coupled to one or more storage disks 224 (known as "just 
a bunch of disks" or JBOD), associated with a JBOD enclosure 222. 

[001 9] FIG. 2 depicts a moderate-sized system, with a typical two-to-one ION 212 to compute node ratio. The clique 
226 of the present invention could also be implemented with three or more lONs 21 4, or with some loss in storage node 
55 availability, with a single ION 212. Clique 226 population is purely a software matter as there is no shared hardware 
among lONs 212. Paired lONs 212 may be referred to as "dipoles." 

[0020] The present invention also comprises a management component or system administrator 230 which interfaces 
with the compute nodes 200, lONs 212, and the interconnect fabrics 106. 
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Interna/ Architecture 
10 Hardware Architecture 
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port ION 2 12 operation described herein, and a power module 306 for providing power to sup- 



75 



20 



25 



JBODs 

ment 404 is *« 1 «c These elenZt ^J^IT^ t "* ele ™ nt 402 d8sc *« » «ta»nt 0. second eto- 
service „»e, 706 descntecSn *™Z!£^^^°'^ WN -* M ™^»^™<^ 



30 



35 



Bits 
Bytes 
0 



1 



0 



Rack Number 



Element number 



Chassis Position 



Table I 



40 



45 



SO 



55 



S «a*S^^ * «* «*— i and element number, as shown in Table I 
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High Level Driver 

[0028] The High Level Driver (HLD) 502 is the entry point for all requests to the ION 212 no matter what device type 
is being accessed. When a device is opened, the HLD 502 binds command pages to the device. These vendor-specific 
command pages dictate how a SCSI command descriptor block is to be built for a specific SCSI function. Command 
pages allow the driver to easily support devices that handle certain SCSI functions differently than the SCSI Specifica- 
tions specify. 

Common (Non-Device Specific) Portion 

[0029] The common portion of the HLD 502 contains the following entry points: 



cs_init - Initialize driver structures and allocate resources. 

cs_open - Make a device ready for use. 
is cs_close - Complete I/O and remove a device from service. 
cs_strategy - Block device read/write entry (Buf_t interface). 
cs_intr - Service a hardware interrupt. 

[0030] These routines perform the same functions for all device types. Most of these routines call device specific rou- 
20 tines to handle any device specific requirements via a switch table indexed by device type {disk, tape, WORM, CD ROM, 
etc.). 

[0031 ] The cs_open function guarantees that the device exists and is ready for I/O operations to be performed on it. 
Unlike current system architectures, the common portion 503 does not create a table of known devices during initiali- 
zation of the operating system (OS). Instead, the driver common portion 503 is self-configuring: the driver common por- 
25 tion 503 determines the state of the device during the initial open of that device. This allows the driver common portion 
503 to "see" devices that may have come on-line after the OS 202 initialization phase. 

[0032] During the initial open, SCSI devices are bound to a command page by issuing a SCSI Inquiry command to 
the target device. If the device responds positively, the response data (which contains information such as vendor ID, 
product ID, and firmware revision level) is compared to a table of known devices within the SCSI configuration module 
so 516. If a match is found, then the device is explicitly bound to the command page specified in that table entry. If no 
match is found, the device is then implicitly bound to a generic CCS (Common Command Set) or SCSI II command 
page based on the response data format. 

[0033] The driver common portion 503 contains routines used by the low level driver 506 and command page func- 
tions to allocate resources, to create a DMA list for scatter-gather operations, and to complete a SCSI operation. 
35 [0034] All FCI low level driver 506 routines are called from the driver common portion 503. The driver common portion 
503 is the only layer that actually initiates a SCSI operation by calling the appropriate low level driver (LLD) routine to 
setup the hardware and start the operation. The LLD routines are also accessed via a switch table indexed by a driver 
ID assigned during configuration from the SCSI configuration module 516. 

40 Device Specific Portion 

[0035] The interface between the common portion 502 and the device specific routines 504 are similar to the inter- 
faces to the common portion, and include csxxjnit, csxx_open, csxx_close, and csxx_strategy commands. The "xx" 
designation indicates the storage device type (e.g. "dk" for disk or "tp" for tape). These routines handle any device spe- 
45 crfic requirements. For example, if the device were a disk, csdk_open must read the partition table information from a 
specific area of the disk and csdk_strategy must use the partition table information to determine if a block is out of 
bounds. (Partition Tables define the logical to physical disk block mapping for each specific physical disk.) 



so 



High Level Driver Error/Failover Handling 
Error Handling 
Retries 

55 [0036] The HLD's 502 most common recovery method is through retrying l/Os that failed. The number of retries for a 
given command type is specified by the command page. For example, since a read or write command is considered 
very important, their associated command pages may set the retry counts to 3. An inquiry command is not as important, 
but constant retries during start-of-day operations may slow the system down, so its retry count may be zero. 
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[0037] When a request is first issued, its retry count is set to zero. Each time the request fails and the recoverv 
scheme ,s to retry, the retry count is incremented. If the retry count is g^tettamaSTrl^^^ SE 
by the command page, the I/O has failed, and a message is transmitted back to the requester Oth7rL?e it is reS^ 

IlLTon Z T ™J! this rule is for unB attentions - which typical1 * are ^S^^S^^J^ 

attention ,s received for a command, and its maximum retries is set to zero or one, the High Level Driver 502 s<£ tt» 

Failed Scsi_ops 

[0039] A Scsi^p that is issued to the FCI low level driver 506 may fail due to several circumstances Table II below 
shows poss.ble failure types the FCI low level driver 506 can return to the HLD 402. cumsrances - » below 
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Error 


Error Type 


Recovery 


Logged 




No Sense 


Check Condition 


This is not considered an error. Tape 


YES 


5 






devices typically return this to report 
Illegal Length Indicator. This should not 
be returned by a disk device. 






Recovered Error 


Check Condition 


This is not considered an error. Disk 
devices return this to report soft errors. 


YES 


10 


Not Ready 


Check Condition 


The requested I/O did not complete. For 
disk devices, this typically means ihe disk 
has not spun up yet. A Delayed Retry 
will be attempted. 


YES 




Medium Error 


Check Condition 


The I/O for the block request failed due 


YES 


15 






to a media error. This type of error 
typically happens on reads since media 
errors upon write are automatically 
reassigned which results Ln Recovered 
Errors. These errors are retried. 




20 


Hardware Error 


Check Condition 


The I/O request failed due to a hardware 
error condition on the device. These 
errors are retried. 


YES 


25 


Illegal Request 


Check Condition 


The I/O request failed due to a request 
the device does not support. Typically 
these errors occur when applications 
request mode pages that the device does 
not support. These errors are retried. 


YES 




Unit Attention 


Check Condition 


All requests that follow a device power- 
up or reset fail with Unit Attention. 


NO 


30 






These errors are retried. 






Reservation Conflict 


SCSI Status 


A request was made to a device that was 
reserved by another initiator. These 
errors are not retried. 


YES 




Busy 


SCSI Status 


The device was too busy to fulfill the 


YES 


35 






request. A Delayed retry will be 
attempted. 






No Answer 


SCSI/Fibre 
Channel 


The device that an I/O request was sent to 
does not exist. These errors are retried. 


YES 




Reset 


Low Level Driver 


The request failed because it was 


YES 


40 






executing on the adapter when the 
adapter was reset. The Low Level Driver 
does all error handling for this condition. 






Timeout 


Low Level Driver 


The request did not complete within a set 
period of time. The Low Level D iver 


YES 


45 






does all handling for this condition. 






Parity Error 


Low Level Driver 


The request failed because the Low Level 
Driver detected a parity error during the 
DMA operation. These will typically be 
the result of PCI parity errors. This 


YES 


SO 






request will be retried. 





Table II: Low Level Driver Error Conditions 
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Structure Name 
HCB 


Memory Type 
Private 


Description 

Hardware Control Block. Every Five Channel 
Adapter has associated with it a single FTPR 
structure which is initialized at start of day. The 
HCB describes the adapter's capabilities as well 
as being used to manage adapter specific 
resources. 


IOB 


Private 


10 Request Block. Used to describe a single I/O 
request. All I/O requests to the HIM layer use 
IOB's to describe them. 


LINK_MANAGER 


Private 


A structure to manage the link status of all 
targets on the loop. 



Table IV FC Key Data Structures 



20 

Error Handling 

[0055] Errors that the FC. low level driver 506 handles tend to be errors specific to Fibre Channel and/or FC. itself. 
25 Multiple Stage Error Handling 

30 

Failed lOBs 

35 



40 



45 



50 
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Error 


Error Type 


Recovery 


T nnoAn 

l^oggea 


Queue Full 


SCSI/FCP Status 


This error should not be seen if the IONs 
212 are properly configured, but if it is 
seen, the I/O will be placed back onto the 

mipup tf\ t-*#* f »f n An T/tl usiil npv/pr n£* 

QUCUC l&J UC ICillCU. /VII M\J Will III^Vvl UC 

failed back due to a Queue Full. 


YES 


Other 


SCSI/FCP Status 


Other SCSI/FCP Status errors like Busy 
and Check Condition is failed back to the 
High Level Driver 502 for error recovery. 


NO (HLD does 

necessary 

logging) 


Invalid D_ID 


Fibre Channel 


Access to a device that does not exist was 
aiienripieu. ireaicu iikc a o^oi •selection 
Timeout is sent back to High Level 

ui i vci i or icwjvcrv . 


NO 


Port Logged Out 


Fibre Channel 


A request to a device was failed because 
the device thinks it was not logged into. 
FCI treats it like a SCSI Selection 
Timeout. The High Level Drivers 502 
retry turns into a FC-3 Port Login prior 
to re-issuing the request. 


YES 


IOB Timeout 


FCI 


A I/O that was issued has not completed 
within a specified amount of time. 


YES 


L*oop ra.nu re 




Thi^ i<; due to a nremature completion of 
an I/O due to a AL Loop Failure. This 
could happen if a device is hot-plugged 
onto a loop when frames are being sent 
on the loop. The FCI LLD handles this 
through a multiple stage recovery . 

1 ) Delayed Retry 

2) Reset Host Adapter 

3) Take Loop Offline 


YES 


Controller Failure 


AHIM 


This occurs when the HIM detects an 
adapter hardware problem. The I CI 
LLD handles this through a multiple 
stage recovery. 

1 ) Reset Host Adapter 

2) Take Loop Offline 


YES 


Port Login Failed 


FC-3 


An attempt to login to a device failed. 
Handled like a SCSI Selection Timeout. 


NO 


Process Login Failed 


FC-3/FC-4 


An attempt to do a process login to a 
FCP device failed. Handled like a SCSI 
Selection Timeout. 


NO 



Table V: HIM Error Conditions 



so Insufficient Resources 

[0058] The FCI low level driver 506 manages resource pools for lOBs and vector tables. Since the size of these pools 
will be tuned to the ION 212 configuration, it should not be possible to run out of these resources, simple recovery pro- 
cedures are implemented. 

55 [0059] If a request for an IOB or vector table is made, and there are not enough resources to fulfill the request, the 
I/O is placed back onto the queue and a timer is set to restart the I/O. Insufficient resource occurrences are logged. 
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Start Of Day Handling 



Failover Handling 

15 Sns iia^l^ustt^ir 6 ' ON 226 attaCh6d t0 3 COmm ° n S8t 0f disk At any given time both 

Hardware Interface Module (HIM) 

S s?sl~^*^!^: ^ HIM) fe dea ' 9ned t0 interfaCe Wlth ADAPTEC'S SlimHIM 509. The HIM 
<?i^u?m *™ ? ' .esponaburty rar translating requests from the FCI low level driver 506 to a reauest that th« 
ShmHIM 509 can understand and issue to the hardware. This involves taking I/O Block (IOB) rwuesi SdSnSn 
££n ^ pond f in 9 Transfe / B'ock (TCB) requests that are understood byt SSmIX 



20 
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30 



35 



40 



Structure Name 



TCB 



Memory Typ e 



Private 



Description 



Task Control Block. An AIC-1 160 specific 
structure to describe a Fibre Channel I/O. All 
requests to the AIC-1 160 (LIP, Logins, FCP 
commands, etc) are issued through a TCB. 



Table VI: Key HIM Structures 



45 

Start Of Day Handling 



50 



55 



« and TCB ^ This a*™^ „ £ ^^^X^ IRQ ' 



BNISnnniD' <EP 0935200A1 I > 



12 



EP 0 935 200 A1 



Failover Handling 

[0068] The two halves of the ION dipole 226 are attached to a common set of disk devices. At any given time, both 
lONs 212, 214 must be able to access all devices. From the HIM's 509 view-point, there is no special handling for 
5 failovers. 

AIC-1160 SlimHIM 

[0069] The SlimH IM 509 module has the overall objective of providing hardware abstraction of the adapter (in the illus- 
10 trated embodiment, the ADAPTEC AIC-1160). The SlimHIM 509 has the primary role of transporting fibre channel 
requests to the AIC-1 160 adapter, servicing interrupts, and reporting status back to the HIM module through the Slim- 
HIM 509 interface. 

[0070] The SlimHIM 509 also assumes control of and initialises the AIC-1 160 hardware, loads the firmware, starts 
run time operations, and takes control of the AIC-1 160 hardware in the event of an AIC-1 160 error. 

15 

External Interfaces and Protocols 

[0071] All requests of the ION Physical disk driver subsystem 500 are made through the Common high level driver 
502. 

20 

Initialization (cs_init) 

[0072] A single call into the subsystem performs all initialization required to prepare a device for l/Os. During the sub- 
system initialization, all driver structures are allocated and initialized as well as any device or adapter hardware. 

25 

Open/Close (cs_ open/cs_ close) 

[0073] The Open/Close interface 51 0 initializes and breaks down structures required to access a device. The interface 
510 is unlike typical open/close routines because all "opens" and "closes" are implicitly layered. Consequently, every 
30 "open" received by the I/O physical interface driver 500 must be accompanied by a received and associated "close," 
and device-related structures are not freed until all "opens" have been "closed." The open/close interfaces 510 are syn- 
chronous in that the returning of the "open" or "close" indicates the completion of the request. 

BufJ (cs_strategy) 

35 

[0074] The Bufjt interface 512 allows issuing logical block read and write requests to devices. The requester passes 
down a BufJ structure that describes the I/O. Attributes like device ID, logical block address, data addresses, I/O type 
(read/write), and callback routines are described by the BufJ. Upon completion of the request, a function as specified 
by the callback by the requester is called. The BufJ interface 512 is an asynchronous interface. The returning of the 
40 function back to the requester does not indicate the request has been completed. When the function returns, the I/O 
may or may not be executing on the device. The request may be on a queue waiting to be executed. The request is not 
completed until the callback function is called. 

SCSlUb 

45 

[0075] SCSILib 51 4 provides an interface to allow SCSI command descriptor blocks (CDBs) other than normal reads 
and writes to be sent to devices. Through this interface, requests like Start and Stop Unit will be used to spin and spin 
down disks, and Send and Receive Diagnostics will be used to monitor and control enclosure devices. All SCSILib rou- 
tines are synchronous. The returning of the called function indicates the completion of the request. 

50 

Interrupts (cs_intr) 

[0076] The ION physical disk driver 500 is the central dispatcher for all SCSI and Fibre Channel adapter interrupts. 
In one embodiment, a Front-End/Back-End interrupt scheme is utilized. In such cases, when an interrupt is serviced, a 
55 Front-End Interrupt Service Routine is called. The Front-End executes from the interrupt stack and is responsible for 
clearing the source of the interrupt, disabling the adapter from generating further interrupts and scheduling a Back-End 
Interrupt Service Routine. The Back-End executes as a high-priority task that actually handles the interrupt (along with 
any other interrupts that might have occurred between the disabling of adapter interrupts and the stark of the Back-End 
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task). Before exiting the Back-End, interrupts are re-enabled on the adapter. 

ION Functions 

s [0077] lONs 212 perform five primary functions. These functions include: 

Storage naming and pro j ection : Coordinates with the compute nodes 200 to provide a uniform and consistent nam- 
ing of storage, by projecting images of storage resource objects stored on the storage disks 224 to the compute 
nodes 200; 

io Disk management : implements data distribution and data redundancy techniques with the storage disk drives 224 
operatively coupled to the ION 21 2; 

Storaqe manaqement : for handling storage set up, data movement, including processing of I/O requests from the 
compute nodes 200; performance instrumentation, and event distribution. 

Cachg management : for read and write data caching, including cache fill operations such as application hint Dre- 
is fetch. K 

Interconnect manaqement : to control the flow of data to and from the compute nodes 200 to optimize performance 
and also controls the routing of requests and therefore controls the distribution of storage between the two lONs 
212 in adipole 226. 

20 Storage Naming and Projection 

F00781 IONS 212 nrniftC* imanoc of etnrpno rnrm iri~sN »u; A ^ I _x ~ j. — •■ i ~~ .. . 

■ - ~ ' ~» — t - ^o.age .«Sw«.oo (JujoCio aideu un me storage ojsks 2*4 to tne compute nodes 

200. An important part of this function is the creation and allocation of globally unique names, fabric unique volume set 
IDs (VSIs) 602 for each storage resource (including virtual fabric disks) managed by the ION 212. 
[0079] FIG. 6 is a diagram showing the structure and content of the VSI 602 and associated data. Since it is important 
that the VSIs 602 be unique and non-conflicting, each ION 21 2 is responsible for creating and allocating globally unique 
names for the storage resources managed locally by that ION 212, and only that ION 212 managing the storage 
resource storing the storage resource object is permitted to allocate a VSI 602 for that storage resource. Although only 
the ION 212 currently managing the resident storage resource can create and allocate a VSI 602 other lONs 212 may 
thereafter manage storage and retrieval of those storage resources. That is because the VSI 602 for a particular data 
object does not have to change if an lON-assigned VSI 602 is later moved to a storage resource managed by another 

[0080] The VSI 602 is implemented as a 64-bit number that contains two parts: an ION identifier 604, and a sequence 
number. 506. The ION identifier 604 is a globally unique identification number that is assigned to each ION 212 One 
technique of obtaining a globally unique ION identifier 604 is to use the electronically readable motherboard serial 
number that is often stored in the real time clock chip. This serial number is unique, since it is assigned to only one 
motherboard. Since the ION identifier 604 is a globally unique number, each ION 212 can allocate a sequence number 
606 that is only locally unique, and still create a globally unique VSI 602. 

[0081] After the VSI 602 is bound to a storage resource on the ION 212, the ION 212 exports the VSI 602 through a 
broadcast message to all nodes on the fabric 106 to enable access to the storage resource 104. This process is further 
discussed in the ION name export section herein. 

[0082] Using the exported VSI 602, the compute node 200 software then creates a local entry point for that storage 
resource that is semantical^ transparent in that it is indistinguishable from any other locally attached storage device 
For example, if the compute node operating system 202 were UNIX, both block device and raw device entry points are 
created in the device directory similar to a locally attached device such as peripherals 1 08 or disks 210. For other oper- 
at.ng systems 202, similar semantic equivalencies are followed. Among compute nodes 200 running different operating 
systems 202, root name consistency is maintained to best support the heterogeneous computing environment Local 
entry points in the compute nodes 200 are dynamically updated by the ION 212 to track the current availability of the 
exported storage resources 104, The VSI 602 is used by an OS dependent algorithm running on the compute node 200 
to create device entry point names for imported storage resources. This approach guarantees name consistency 
among the nodes that share a common operating system. This allows the system to maintain root name consistency to 
support a heterogeneous computing environment by dynamically (instead of statically) creating local entry points for 
globally named storage resources on each compute node 200. 

S 08 ,™ AS discussed above ' the details of creatin 9 the VSI 602 for the storage resource 104 are directly controlled by 
the ION 212 that is exporting the storage resource 104. To account for potential operating system 104 differences 
among the compute nodes 200, one or more descriptive headers is associated with each VSI 602 and is stored with the 
VSI 602 on the ION 212. Each VSI 602 descriptor 608 includes an operating system (OS) dependent data section 610 
for stonng sufficient OS 202 dependent data necessary for the consistent (both the name and the operational semantics 
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are the same across the compute nodes 200) creation of device entry points on the compute nodes 200 for that partic- 
ular VSI 602. This OS dependent data 610 includes, for example, data describing local access rights 612, and owner- 
ship information 61 4. After a VSI 602 is established by the ION 212, imported by the compute node 200, but before the 
entry point for that storage resource 104 associated with the VSI 602 can be created, the appropriate OS specific data 

s 610 is sent to the compute node 200 by the ION 212. The multiple descriptive headers per VSI 602 enable both con- 
current support ol multiple compute nodes 200 running different OSs (each OS has its own descriptor header) and sup- 
port of disjoint access rights among different groups of compute nodes 200. Compute nodes 200 that share the same 
descriptor header share a common and consistent creation of device entry points. Thus, both the name and the oper- 
ational semantics can be kept consistent on all compute nodes 200 that share a common set of access rights. 

10 [0084] Hie VSI descriptor 608 also comprises an alias field 61 6, which can be used to present a human-readable VSI 
602 name on the compute nodes 200. For example, if the alias for VS1 1 984 is "soma," then the compute node 200 will 
have the directory entries for both 1 984 and "soma/ Since the VSI descriptor 608 is stored with the VSI 602 on the ION 
212, the same alias and local access rights will appear on each compute node 200 that imports the VSI 602. 
[0G85] As described above, the present invention uses a naming approach: suitable for a distributed allocation 

is scheme. In this approach, names are generated locally following an algorithm that guarantees global uniqueness. While 
variations of this could follow a locally centralized approach, where a central name server exists for each system, avail- 
ability and robustness requirements weigh heavily towards a pure distributed approach. Using the foregoing, the 
present invention is able to create a locally executed algorithm that guarantees global uniqueness. 
[0086] The creation of a global consistent storage system requires more support than simply preserving name con- 

20 sistency across the compute nodes 200. Hand in hand with names are the issues of security, which take two forms in 
the present invention. First is the security of the interface between the lONs 212 and the compute nodes 200; second 
is the security of storage from within the compute node 200. 

Storage Authentication and Authorization 

25 

[0087] A VSi 602 resource is protected with two distinct mechanisms, authentication, and authorization. If a compute 
node 200 is authenticated by the ION 212, then the VSI name is exported to the compute node 200. An exported VSI 
602 appears as a device name on the compute node 200. Application threads running on a compute node 200 can 
attempt to perform operations on this device name. The access rights of the device entry point and the OS semantics 
30 of the compute nodes 200 determines if an application thread is authorized to perform any given authorization. 

[0088] This approach to authorization extends compute node 200 authorization to storage resources 1 04 located any- 
where accessible by the interconnect fabric 106. However, the present invention differs from other computer architec- 
tures in that storage resources 104 in the present invention are not directly managed by the compute nodes 200. This 
difference makes it impractical to simply bind local authorization data to file system entities. Instead, the present inven- 
ts tion binds compute node 200 authorization policy data with the VSI 602 at the ION 212, and uses a two stage approach 
in which the compute node 200 and the ION 212 share a level of mutual trust. An ION 212 authorizes each compute 
node 200 access to a specific VSI 602, but further refinement of the authorization of a specific application thread to the 
data designated by the VSI is the responsibility of the compute node 200. Compute nodes 200 then enforce the author- 
ization policy for storage entities 104 by using the policies contained in the authorization metadata stored by the ION 
40 212. Hence, the compute nodes 200 are require to trust the ION 212 to preserve the metadata and requires the ION 
212 to trust the compute node 200 to enforce the authorization. One advantage of this approach is that it does not 
require the ION 21 2 to have knowledge regarding how to interpret the metadata. Therefore, the ION 212 is isolated from 
enforcing specific authorization semantics imposed by the different authorization semantics imposed by the different 
operation systems 202 used by the compute nodes 200. 
45 [0089] All data associated with a VSI 602 (including access rights) are stored on the ION 21 2, but the burden of man- 
aging the contents of the access rights data is placed on the compute nodes 200. More specifically, when the list of VSIs 
602 being exported by an ION 212 are sent to a compute node 200, associated with each VSI 602 is all of the OS spe- 
cific data required by the compute node 200 to enforce local authorization. For example, a compute node 200 running 
UNIX would be sent the name, the group name, the user ID, and the mode bits; sufficient data to make a device entry 
so node in a file system. Alternative names for a VSI 602 specific for that class of compute node operating systems 202 
(or specific to just that compute node 200) are induded with each VSI 602. Local OS specific commands that alter 
access rights of a storage device are captured by the compute node 200 software and converted into a message sent 
to the ION 212. This message updates VSI access right data specific to the OS version. When this change has been 
completed, the ION 212 transmits the update to alt compute nodes 200 using that OS in the system. 
55 [0090] When a compute node (CN) 200 comes on line, it transmits an "I'm here" message to each ION 21 2. This mes- 
sage includes a digital signature that identifies the compute node 200. If the compute node 200 is known by the ION 
212 (the ION 212 authenticates the compute node 200), the ION 212 exports every VSI name that the compute node 
200 has access rights to. The compute node 200 uses these lists of VSI 602 names to build the local access entry 
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degenerative conditions) can then be resolved via the generation number. As long as no compute nodes 200 are using 
the VSI 602, a newcomer with a higher generation number can be allowed to invalidate the current exporter of a specific 
VSI 602. 

5 Name Service 
ION Name Export 

[0098] An ION 212 exports the Working Set of VSIs 602 that it exclusively owns to enable access to the associated 

w storage. The Working Set of VSIs exported by an ION 212 is dynamically determined through VSI ownership negotia- 
tion with the Buddy ION (the other ION 212 in the dipole 226, denoted as 214) and should be globally unique within all 
nodes communicating with the interconnect fabric 106. The set is typically the default or PRIMARY set of VSIs 602 
assigned to the ION 212. VSI Migration for Dynamic Load Balancing and exception conditions that include buddy ION 
214 failure and I/O path failure may result in the exported VSI 602 set to be different than the PR f MARY set. 

15 [0099] The Working Set of VSis is exported by the ION 212 via a broadcast message whenever the Working Set 
changes to provide compute nodes 1 00 with the latest VSI 602 configuration. A compute node 200 may also interrogate 
an ION 21 2 for its working set of VSIs 602. I/O access to the VSIs 602 can be initiated by the compute nodes 200 once 
the ION 21 2 enters or reenters the online state for the exported VSIs 602. As previously described, an ION 212 may not 
be permitted to enter the online state if there are any conflicts in the exported VSIs 602. The VSIs 602 associated with 

20 a chunk of storage should be all unique but there is a chance that conflicts may arise (for example, if the VSI were con- 
structed from a unique ID associated with the ION 212 hardware and an ION 212 managed sequence number, and the 
ION 212 hardware were physically moved) where multiple chunks of storage may have the same VSI. 
[0100] Once the Working Set has been exported, the exporting ION 212 sets a Conflict Check Timer (2 seconds) 
before entering the online state to enable I/O access to the exported VSIs 602. The Conflict Check Timer attempts to 

25 give sufficient time for the importers to do the conflict check processing and to notify the exporter of conflicts but this 
cannot be guaranteed unless the timer is set to a very large value. Therefore, an ION 212 needs explicit approval from 
all nodes (compute nodes 200 and lONs 212) to officially go online. The online broadcast message is synchronously 
responded to by all nodes and the result is merged and broadcasted back out. An ION 212 officially enters the online 
state if the merged response is an ACK. If the ION 212 is not allowed to go online, the newly exported set of VSIs 602 

30 cannot be accessed. The Node(s) that sent the NAK also subsequently send a VSI conflict message to the exporter to 
resolve the conflict. Once the conflict is resolved, the ION 212 exports its adjusted Working Set and attempts to go 
online once again. 

CN Name Import 

35 

[01 01 ] The compute nodes 200 are responsible to take actions to import all VSIs 504 exported by all lONs 21 2. During 
Start of Day Processing, a compute node 200 requests from all online lONs 212 for VSIs 602 that were previously 
exported so that it can get an up to date view of the name space. From that point on, a compute node 200 listens for 
VSI 602 exports. 

40 [0102] Control information associated with a VSI 602 is contained in a vsnode that is maintained by the ION 212. The 
compute node 200 portion of the vsnode contain information used for the construction and management of the Names 
presented to applications 204. The vsnode information includes user access rights and Name Aliases. 

Name Domain and Aliases 

45 

[0103] VSIs 602 may be configured to have an application defined Name: Alias that provides an alternate name to 
access the associated storage. The Name Aliases can be attached to a Virtual Storage Domain to logically group a set 
of Names. Name Aliases must be unique within a Virtual Storage Domain. 

so VSNODE 

[0104] Modifications to the vsnode by a compute node 200 is sent to the owning ION 212 for immediate update and 
processing. The vsnode changes is then propagated by the ION 21 2 to all nodes by exporting the changes and reenter- 
ing the online state. 

55 

Storage Disk Management 

[0105] The JBOD enclosure 222 is responsible for providing the physical environment for the disk devices as well as 
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[0106] In the past, management applications typically interfaced with enclosures through an out-of-band connection 

[0107] The m-band connection uses a set of SCSI commands originating from the host that are sent to a SCSI d*,ir» 
for queryng and controlling the configuration status, and a mechanism for a device to comlnSh^S S 
with the enclosure itself. Tne portion of the protocol between the host and the disk dnSTJSS JJS?5 
^SrSr^S^ Sp6CifiCation ' which is incorporated by reference her^n * " SCS '" 3 

SES interiace: SEND D,AGNOST,C and 

merrts retSy ° MGA/ ° S77CS RESULTS USed to control and receiv * **« information from enclosure ele- 
[0109] When using the SEND DIAGNOSTICS or RECEIVE DIAGNOSTICS flP<?/ // tq ^m mn ^ 

pages that can be requested via the SEND DIAGNOSTICS and RECEIVE D^GNOSTICS Jl^rSf ® 
deta,led in Table VII below. Bolded items are required by the SES Event Monitor ^ * 
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Page Code 


SEND DIAGNOSTIC 


RECEIVE DIAGNOSTIC RESULTS 


Oh 


N/A 


Supported Diagnostics 


1h 


N/A 


Configuration 


2h 


Enclosure Count 


Enclosure Status 


3h 


N/A 


ES Help Text 


4h 


ES String Out 


ES String In 


5h 


ES Threshold Out 


ES Threshold In 


6h 


ES Array Control 


ES Array Status 


7h 


N/A 


Element Descriptor 


8h-3Fh 


Reserved (applies to all device types) 


Reserved (applies to all device types) 


40h-7fh 


Specific device type 


Specific device type 


80h-FFh 


Vendor specific pages 


Vendor specific pages 



[0110] The application client may periodically poll the enclosure by executing a READ DIAGNOSTICS ««w* tq 

cafon Cent can reissue the command with a greater allocation length to obtain the complete ^ 

ION Enclosure Management 

d^^L^TS*? relationships between tne ION 's Enclosure Management modules and the ION physical disk 

oSifSr p t u T™ rateS * hiS SUbSyStem " *• SES Event 702 and ScS to SES 

StSnTS" Ji f h ° nrt ° r 702 * S res P° nsible for monitoring all attached enclosure service proems and in 

dyer ,uo necessary. The SCC2+ to SES Gasket component 704 is responsible for translating SCC2+ 



18 

BNSDOCID: <EP 0935200A1_L> 



EP 0 935 200 A1 



commands coming from configuration and maintenance applications and translating them into one or more SES com- 
mands to the enclosure service process. This removes the need for the application client to know the specifics of the 
JBOD configuration. 

5 SES Event Monitor 

[01 1 2] The SES Event Monitor 702 reports enclosure 222 service process status changes back to the Management 
Service Layer 706. Status information gets reported via an Event Logging Subsystem. The SES Event Monitor 702 peri- 
odically polls each enclosure process by executing a READ DIAGNOSTICS RESULTS command requesting the enclo- 
io sure status page. The READ DIAGNOSTICS RESULTS command will be sent via the SCSILib interface 514 as 
provided by the ION physical device disk driver 500. Statuses that may be reported include status items listed in Table 
VIII below. 



Element 


Status 


Description 


All 


OK 


Element is installed and no error conditions are 
known. 


Not Installed 


Element is not installed in enclosure. 


Critical 


Critical Condition is detected. 






Disk 


Fault Sensed 


The enclosure or disk has detected a fault 
condition 


Power Supply 


DC Overvoltage 


An overvoltage condition has been detected at the 
power supply output. 


DC Undervoltage 


An undervoltage condition has been detected at 
the power supply output 


Power Supply Fail 


A failure condition has been detected. 


Temp Warn 


An over temperature has been detected. 


Off 


The power supply is not providing power. 


Cooling 


Fan Fail 


A failure condition has been detected. 


Off 


Fan is not providing cooling. 



Table VIII: Enclosure Status Values 



40 [01 1 3] When the SES Event Monitor 702 starts, it reads in the status lor each element 402-424 contained in the enclo- 
sure. This status is the Current Status. When a status change is detected, each status that changed from the Current 
Status is reported back to the Management Service Layer 706. This new status is now the Current Status. For example, 
if the current status tor a fan element is OK and a status change now reports the element as Fan Fail, an event will be 
reported that specifies a fan failure. If another status change now specifies that the element is Not Installed, another 

45 event will be reported that specifies the fan has been removed from the enclosure. If another status change specifies 
that the fan element is OK, another event will be generated that specifies that a fan has been hot-plugged and is working 
properly. 

Start Of Day Handling 

50 

[01 1 4] The SES Event Monitor 702 is started af ier the successful initialization of the ION physical disk driver 500. After 
starting, the SES Event Monitor 602, reads the JBOD and SCSI Configuration Module 51 6 to find the correlation of disk 
devices and enclosure service devices, and how the devices are addressed. Next the status of each enclosure status 
device is read. Then, events are generated for all error conditions and missing elements. After these steps are com- 
55 pleted. the status is now the Current Status, and polling begins. 
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SCC2+ to SES Gasket 



. t01 - 15 2^o CC2+ iS * he pr ° t0C01 USSd by the l0N 212 t0 confi 9ure and manage Virtual and Physical devices The plus 
+ in SCC2+ represents the additions to the SCC2 which allow full manageability of the ION* 212 devices and compo- 
nents, and to allow consistent mapping of SCC2 defined commands to SES. 

^ll 6 ,L.^ Se ^ iCe Uyer 706 addresses JBOD enclosure 222 elements through SCC2 MAINTENANCE IN and 
MAINTENANCE OUT commands. The following sections describe the service actions which provide ZSrtlS! 

Z ion IT* COntr0 " mg ; St8tUS ° f thS com P° nente - Each ° f t"ese commands will be implemented on 

he ION 212 as a senes of SEND DIAGNOSTIC and RECEIVE DIAGNOSTIC RESULTS SCSI commands 
[01 1 7] Configuration of components will be performed using the following service actions. 

«° D rZ°T?, NEN \ ° EW 5 " ™* ADD COMPONENT DEVICE command is used to configure component 
devices into the system, and to define their LUN addresses. The LUN address will be assigned by the ION 212 
based on the components position in the SES Configuration Page. The REPORT COMPONENT DEVICE service 
action is performed following this command to obtain the results of the LUN assignments 

REPORT COMPONENT DEVICE - The REPORT COMPONENT DEVICE STATUS service action is a vendor 
unique command intended to retrieve complete status information about a component device. SES provides four 
TnMPn^n^?^ 1 ThiS n6W command is because the REPORT STATES and REPORT 

22£ ^ w f !r ,C u S 3 ? '° nS a " OCate ° nly ° ne byte tor status information, and the defined status codes 
conflict with those defined by the SES standard 

™ °? M ™T N J DEV '°f ' ^ ATTACH COMPONENT DEVICE requests that one or more logical units be 
x>gxa..y a..acned to .he spec.isd compone.it device. This command may be used to form logical associations 
between volume sets and the component devices upon which they are dependent, such as fans, power supplies, 

component device be replaced with another. 

REMOVE COMPONENT DEVICE - The REMOVE PERIPHERAL DEVICE/COMPONENT DEVICE service 
actions requests that a peripheral or component device be removed from the system configuration. If a component 
Sn/nnJ^ 8 J* ' S bein 9 remov «J. *e command will be terminated with a CHECK 

MLUN^FA^Id^ REQUEST with an additional sense qualifier of REMOVE OF LOGI- 

[01 18] Status and other information about a component may be obtained through the following services actions: 

REPORT COMPONENT STATUS - The REPORT COMPONENT DEVICE STATUS service action is a vendor 
unique command intended to retrieve complete status information about a component device. SES provides four 
bytes of status for each element type. The REPORT STATES and REPORT COMPONENT DEVICE service 
actons allocate only one byte for status information, and the defined status codes conflict with those defined by the 
SES standard. Therefore this new command is required 

HZ°^J TATES " ThS REP ° RT STATES SerViCe action requests state information about the selected logical 
units. A list of one or more states for each logical unit will be returned 

REPORT COMPONENT DEVICE - The REPORT COMPONENT DEVICE service action requests information 
regarding component device(s) within the JBOD. An ordered list of LUN descriptors is returned reporting the LUN 

to S^^TTmm^h ° Vera " St3tUS - ™ S C ° mmand Wi " be USed 35 part of the initial configuration process 
4s to determine the LUN address assigned by the ADD COMPONENT DEVICE service action 

REPORT COMPONENT DEVICE ATTACHMENTS - The REPORT COMPONENT DEVICE ATTACHMENTS serv- 
ice act.on requests information regarding, logical units which are attached to the specified component devicefs) A 
list of component device descriptors is returned, each containing a list of LUN descriptors. The LUN descriptors 
specify the type and LUN address for each logical unit attached to the corresponding component esCnptors 
REPORT COMPONENT DEVICE IDENTIFIER - The REPORT COMPONENT DEVICE IDENTIFIER service 
acton requests the location of the specified component device. An ASCII value indicates the position of the com- 
ponent will be returned. This value must have been previously set by the SET COMPONENT DEVICE IDENTIFIER 
service action. 

55 [0119] Management of components will be performed through the following: 

INSTRUCT COMPONENT DEVICE - The INSTRUCT COMPONENT DEVICE command is used to send control 
instructions, such as power on or off. to a component device. The actions that may be applied to a particular device 
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vary according to component type, and are vendor specific. 

BREAK COMPONENT DEVICE - The BREAK COMPONENT DEVICE service action places the specified compo- 
nents) into the broken (failed) state. 

5 Interconnect Fabric 

Overview 

[0120] Since it allows more data movement, the fabric attached storage model of the present invention must address 
10 I/O performance concerns due to data copies and interrupt processing costs. Data copy, interrupt and flow control 
issues are addressed in the present invention by a unique combination of methods. Unlike the destination-based 
addressing model used by most networks, the present invention uses a sender-based addressing model where the 
sender selects the target buffer on the destination before the data is transmitted over the fabric. In a sender-based 
model, the destination transmits to the sender a list of destination addresses where messages can be sent before the 
75 messages are sent. To send a message, the sender first selects a destination buffer from this list. This is possible 
because the target side application has already given the addresses for these buffers to the OS for use by the target 
network hardware, and the network hardware is therefore given enough information to transfer the data via a DMA oper- 
ation directly into the correct target buffer without a copy 

[0121 ] While beneficial in some respects, there are several issues with sender-based addressing. First, sender-based 
20 addressing extends the protection domain across the fabric from the destination to include the sender, creating a gen- 
eral lack of isolation and raising data security and integrity concerns. Pure sender-based addressing releases memory 
addresses to the sender and requires the destination to trust the sender, a major issue in a high-availability system. For 
example, consider the case when the destination node has given a list of destination addresses to the sender. Before 
the sender uses all these addresses, the destination node crashes and then reboots. The send-side now has a set of 
25 address buffers that are no longer valid. The destination may be using those addresses for a different purpose. A mes- 
sage sent to anyone of them might have serious consequences as critical data could be destroyed on the destination. 
[0122] Second, the implementation of sender-based addressing requires cooperation of the network to extract the 
destination address from the message before it can initiate the DMA of the data, and most network interfaces are not 
designed to operate this way. 

30 [01 23] What is needed is a addressing model that embraces the advantages of a sender-based model, but avoids the 
problems. The present invention solves this problem with a hybrid addressing model using a unique "put it there" (PIT) 
protocol that uses an interconnect fabric based on the BYNET. 

BYNET and the BYNET interface 

35 

[0124] BYNET has three important attributes which are useful to implement the present invention. 
[0125] First, BYNET is inherently scaleable - additional connectivity or bandwidth can easily be introduced and is 
immediately available to all entities in the system. This is in contrast with other, bus-oriented interconnect technologies, 
which do not add bandwidth as a result of adding connections. When compared to other interconnects, BYNET not only 
40 scales in terms of fan-out (the number of ports available in a single fabric) but also has a bisection bandwidth that 
scales with fan-out. 

[0126] Second, BYNET can be enhanced by software to be an active message interconnect - under its users' (i.e. 
compute resources 102 and storage resources 104) directions, it can move data between nodes with minimal disruption 
to their operations. It uses DMA to move data directly to pre-determined memory addresses, avoiding unnecessary 
45 interrupts and internal data copying. This basic technique can be expanded to optimize the movement of smaller data 
blocks by multiplexing them into one larger interconnect message. Each individual data block can be processed using 
a modification of the DMA-based technique, retaining the node operational efficiency advantages while optimizing inter- 
connect use. 

[0127] Third, because the BYNET can be configured to provide multiple fabrics, it is possible to provide further inter- 
so connect optimization using Traffic Shaping. This is essentially a mechanism provided by the BYNET software to assign 
certain interconnect channels (fabrics) to certain kinds of traffic, reducing, for example, the interference that random 
combinations of long and short messages can generate in heavily-used shared channels. Traffic shaping is enabled by 
BYNET, but it will initially be used judiciously as we find our the advantages and drawbacks to specific shaping algo- 
rithms. Responses from experiments and experience will be applied to enhance these algorithms, which may even be 
55 user-selectable for predictable traffic patterns. 

[0128] FIG. 8 shows a diagram of the BYNET and its host side interface 802. The BYNET host side interface 802 
includes a processor 804 that executes channel programs whenever a circuit is created. Channel programs are exe- 
cuted by this processor 804 at both the send 806 and destination 808 interfaces for each node. The send-side interface 
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806 hardware executes a channel program created on the down-call that controls the creation of the circuit the trans 
o 6 ^ r? th \ eventual shutdown ° f *• circuit. The destination-skie interface 808 ha^eUecu^s a 
channel program to dehver the data into the memory at the destination and then complete the circuit 
[0129] The BYNET comprises a network for interconnecting the compute nodes 200 and lONs 212 which ooerate a« 

SSTSTSS n6tWOr L The BYNET C0mpriS6S 3 Plura,ity 0f nodes 81 ° -ith in^ouxZ ^rZu ^l 
swrtch nodes 810 are arranged into more than g(log 5 N) switch node stages 812 where b is the total number 2 LIS 
node jnput/oulput ports. N is the total number of network input/output ports 816 Z fSS5n^£?S a c*at2 
SSI 6 SmaNest j nt T r " 0t 9reater than th6 argument x " ™» *** «*• 810 

^ tK BYN,T n ^ ' nPUt T 8 , 16 ^ netW ° rk ° UtPUt ^ 816 10 6nhance fault tolerance a^ less^oSen 

[0130] Processors implemented in nodes such as compute node 200 and ION 21 2 can be partitioned into one or more 
superclusters. cornpnsing logically independent predefined subsets of processors. CommunicaS betwLn Z™es 
sors can be po.nt to po.nt. or multicast. In the multicast mode of communications, a single processor can ^L^T. 
message to a icrf the other processors or to superclusters. Mult^ 

s.multaneously. The send.ng processor transmits its multicast command which propagates through e fo^dThSne 
to all of the processors or the group of processors. Multicast messages are steered a partLc^rlS^Jn^ I 

sssfnSr. 8 : 8 in the n , etwork ,or subsequent routin9 to *• process ° re in the iissssssssi; 

locking the network because rt permits only one multicast message through the particular bounceback DoTrTaTa Si 
and prevents muKicast messages to different superclusters from interfering with one an5^£^£^£ 

tZ rvn^t' Cast < messa 9 es to them by transmitting, for example, their current status thn^ iLSSSln? 
The BYNET can function to combine the replies in various ways cnannel. 

TiyL?l N bInT^lX rtS ^ tVPeS ° f meSSa9eS> 3n in - band messa 9 e ' and an out-of-band message 
com^tVthf Tfr > 39 ? 6rS thS mSSSafle ,nt ° 3 kernel buffer < or bufferc ) at th e destinations hosfs memory 
completes the circuit and posts an up-call interrupt. With a BYNET out-of-band message, the header date in a dS 

STST < T SSS *.? J"""* hand,er in the BYNET driver to create the channel program that J^SE iS2 
the rest of the c.rcu,t data be.ng received. For both types of messages, the success or failure ofa channel 
returned to the sender via a small message on the BYNET back channel. This M^^ t^^^^* 
part of the crcuit shutdown operation by the channel program at the sender. ^SkZ!!^S^S^2Z 

BYNET hTH^ direCtly ,herefore reQUire 3 date «** To resolve ,his i«!nn^n^S S 
^ * ardware in a un,< * ue wa * lnstead of having the destination side interface 808 cre^V^ti^SS^St 
it needs to process the data, the send interface 806 side creates both the send-side and the diSnatohSe^S 
programs. The send-side channel program transfer, as part of the message, a very sn^wrSon^SS^ 

destination i buffer of the target application thread. Because the sender knows the destination thread Sere i lh?s rZ? 

sage is to be delivered, this technique enab.es the send-side to control both how and wher " a 

SSJ T w°l th w traUma ° f traditfona ' Up - Ca " processin 9 on the des «™°n side. This to™ o JSS 

mcS tSfS£ m r a H 9e f " 3n 3CtiVe m8SSa9e US6d in the active messa ^ inter-process commuSat^ 

S *i ? 3 Sma " m6SSa9e handHn9 routine used to P rocess t"e message at the dSinT 

ton) the present invention uses BYNET directed-band messages in which the BYNET I/O processoT executes le Tm 

pie channel program, while with active messages the host CPU usually executes the activ messTge3 er 

[0 33] The use of the back channel allows the send-side interface to suppress the tradTn^SpShod for sia 

nahng message delivery completion. For both out-of-band and directed-band messages a weSS?cZ!2n^i3" 

[0134] While this guarantees the reliable movement of a message into the memory space at the node it 

S.H h? TS" *? Pr ° CeSSh9 ° f th6 mSSSa9e by the destinati ° n application. FoV example I SSSZSlI 
could have a functional memory system, but have a failure in the destination application thread that co^ orZn^! 

oTSTJSr !? b , 8in9 P :° Ce . SSed ' T ° hand,S re ' iable Pr0C6SSin9 01 messa ° es * ^ Present ^ 
ods are employed independently to both detect and correct failures in message processing In ten^'fttcmnSJ 
cation protocol for the present invention, timeouts are used at the send-side to dLtSSm™^ « * COmmurM " 
= as required and may trigger recovery operations in Z^SESS™"" 
[0135] Even with d,rected-band messages, the present invention must allow message delivery to a ^Sic target at 
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the destination, and a mechanism that gives the sender enough data to send a message to the right target application 
thread buffer. The present invention accomplishes this feat with a ticket-based authentication scheme. A ticket is a data 
structure that cannot be forged, granting rights to the holder, tn essence, tickets are one-time permissions or rights to 
use certain resources. In the present invention, lONs 212 can control the distribution of service to the compute nodes 
5 200 through ticket distribution. In addition, the tickets specify a specific target, a necessary requirement to implement a 
sender-based flow control model. 

The "Put it There" (PIT) Protocol 

10 Overview 

[01 36] The PIT protocol is a ticket-based authentication scheme where the ticket and the data payload are transmitted 
in an active message using the BYNET directed-band message protocol. The PIT protocol is a unique blend of ticket- 
based authentication, sender-based addressing, debit/credit flow control, zero memory copy, and active messages. 

75 

PIT Messages 

[0137] FIG. 9 shows the basic features of a PIT message or packet 901 , which contains a PIT header 902 followed 
by payload data 904. The PIT header 902 comprises a PIT ID 906, which represents an abstraction of the target data 
20 buffer, and is a limited life ticket that represents access rights to a pinned buffer of a specified size. Elements that own 
the PIT ID 906 are those that have the right to use the buffer, and a PIT ID 906 must be relinquished when the PIT buffer 
is used. When a destination receives a PIT message, the PIT ID 906 in the PIT header specifies the target buffer to the 
BYNET hardware where the payload is to be moved via a DMA operation. 

[0138] Flow control under the PIT protocol is a debit/credit model using sender-based addressing. When a PIT mes- 

25 sage is sent, it represents a flow-control debt to the sender and a flow-control credit to the destination. In other words, 
if a device sends a PIT ID 906 to a thread, that thread is credited with a PIT buffer in the address space. If the device 
returns a PIT ID 906 to its sender, the device is either giving up its rights or is freeing the buffer specified by the PIT ID 
906. When a device sends a message to a destination buffer abstracted by the PIT ID 906. the device also gives up its 
rights to the PIT buffer. When a device receives a PIT ID 906, it is a credit for a PIT buffer in the address space of the 

30 sender (unless the PIT ID 906 is the device's PIT ID 906 being returned). 

[01 39] At the top of the header 902 is the BYNET channel program 908 (send-side and destination side) that will proc- 
ess the PIT packet 901 . Next are two fields for transmitting PIT ID tickets: the credit field 910 and the debit field 912. 
The debit field 912 contains a PIT ID 906 where the payload data will be transferred by the destination network interface 
via the channel program. It is called the debit field, because the PIT ID 906 is a debit for the sending application thread 

35 (a credit at the destination thread). The credit field 910 is where the sending thread transfers or credits a PIT buffer to 
the destination thread. The credit field 910 typically holds the PIT ID 906 where the sending thread is expecting to be 
sent a return message. This usage of the credit PIT is also called a SASE (self-addressed stamped envelope) PIT. The 
command field 914 describes the operation the target is to perform on the payload data 904 (for example a disk read or 
write command). The argument fields 91 6 are data related to the command (for example the disk and block number on 

40 the disk to perform the read or write operation). The sequence number 918 is a monotonicaily increasing integer that is 
unique for each source and destination node pair. (Each pair of nodes has one sequence number for each direction). 
The length field 920 specifies the length of PIT payload data 904 in bytes. The flag field 922 contains various flags that 
modify the processing of the PIT message. One example is the duplicate message flag. This is used in the retransmis- 
sion of potential lost messages to prevent processing of an event more than once. 

45 [0140] When the system f irst starts up, no node has PIT IDs 906 for any other node. The BYNET software driver pre- 
vents the delivery of any directed-band messages until the PIT first open protocol is completed. The distribution of PIT 
IDs 906 is initiated when an application thread on a compute node 200 does the first open for any virtual disk device 
located on an ION 212. During the first open, the ION 212 and compute node 200 enter a stage of negotiation where 
operating parameters are exchanged. Part of the first open protocol is the exchange of PIT IDs 906. PIT IDs 906 can 

so point to more than a single buffer as the interface supports both gather DMA at the sender and scatter DMA at the des- 
tination. The application is free to distribute the PIT ID 906 to any application on any other node. 
[0141 ] The size and number of PIT buffers to be exchanged between this compute node 200 and ION 212 are tunable 
values. The exchange of debit and credit PIT IDs 906 (those in debit field 912 and credit field 910 form the foundation 
of the flow control model for the system. A sender can only send as many messages to the destination as there are 

55 credited PIT IDs 906. This bounds the number of messages that a given host can send. It also assures fairness in that 
each sender can at most only exhaust those PIT IDs 906 that were assigned to it, as each node has its own PIT ID 906 
pool. 

[0142] The ION 212 controls the pool of PIT tickets it has issued to compute nodes 200. The initial allocation of PIT 
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IDs 906 to a compute node 200 occurs during the first open protocol. The number of PIT IDs 906 being distributed is 
based on an estimate of the number of concurrent active compute nodes 200 using the ION 212 at one time and the 
memory resources in the ION 212. Since this is just an estimate, the size of the PIT pool can also be adjusted dynam- 
ically during operation by the ION 212. This redistribution of PIT resources is necessary to assure fairness in serving 
requests from multiple compute nodes 200. 

[0143] PIT reallocation for active compute nodes 200 proceeds as follows. Since active compute nodes 212 are con- 
stantly making I/O requests. PIT resources are redistributed to them by controlling the flow of PIT credits in completed 
I/O messages. Until the proper level is reached, PIT credits are not sent with ION 212 completions (decreasing the PIT 
pool for that compute node 200). A more difficult situation is presented for compute nodes 200 that already have a PIT 
allocation, but are inactive (and tying up the resources). In such cases, the ION 212 can send a message to invalidate 
the PIT (or a list of PIT IDs) to each idle compute node 200. If an idle compute node 200 does not respond the ION 212 
may invalidate all the PIT IDs for that node and then redistribute the PIT IDs to other compute nodes 212 When an idle 
r«^ 6 . n0de 200 attem P ts t0 use a reallocated PIT, the compute node 200 is forced back into the first open protocol 
[0144] Increasing the PIT allocation to a compute node 200 is accomplished described below. A PIT allocation mes- 
sage can be used to send newly allocated PIT IDs to any compute node. An alternative technique would be to send 
more than one PIT credit in each I/O completion message. 

PIT Protocol In Action - Disk Read and Write 



[0145] To illustrate the PIT protocol, discussion of a compute node 200 request for a storage disk 224 read operation 
from an ION 212 is presented. Here, it is assumed that the first open has already occurred and there are sufficient num- 
bers of free PIT buffers <m both the compute node 200 and the ION 21 2. An application thread performs a read system 
call, passing the address of a buffer where the disk data is to be transferred to the compute node high level system 
driver (virtual storage interconnect protocol driver). The CN system driver interfaces with the application 204 and the 
fabnc driver on the compute node 200, handles naming, and provides for a binary compatible disk interface The CN 
system driver creates a PIT packet that contains this request (including the virtual disk name, block number and data 
length). The upper half of the CN system driver then fills in the debit and credit PIT ID fields 910 912 The debit PIT 
l!?' d .^lfl S ** PIT ID 906 ° n destination ION 21 2 wn *e this read request is being sent. Since this is a read request 
the ION 212 needs a way to specify the application's buffer (the one provided as part of the read system call) when it 
creates the I/O completion packet. Because PIT packets use send-based addressing, the ION 212 can only address 
the application buffer if it has a PIT ID 906. Since the application buffer is not part of the normal PIT pool the buffer is 
pinned into memory and a PIT ID 906 is created for the buffer. Since the read request also requires return status from 
the disk operation, a scatter buffer for the PIT is created to contain the return status. This SASE PIT is sent in the credit 

J™ 35 ^1° f read PIT packet ^ PIT P 30 ^ is then P |aced on the ongoing queue. When the BYNET interface 
802 sends the PIT packet, it moves it from the send-side via a DMA operation, and then transfers it across the intercon- 
nect fabnc 106. At the destination-side BYNET interface 808, as the PIT packet arrives it triggers the execution of the 
PIT channel program by a BYNET interface processor 804. The BYNET channel processor 804 in the host side inter- 
face 802 extracts the debit PIT ID 906 to locate the endpoint on the ION 21 2. The channel-program extracts the buffer 
address and programs the interface DMA engine to move the payload data directly into the PIT buffer - thus allowing 
the PIT protocol to provide the zero data copy semantics. The BYNET interface 802 posts an interrupt to the receiving 
application on the ION 212. No interrupt occurs on the compute node 200. When the back-channel message indicates 
the transfer failed, then depending on the reason for the failure, the I/O is retried. After several attempts an ION 212 
error state is entered (see the ION 212 recover and fail-over operations described herein for specific details) and the 
compute node 200 may attempt to have the request handled by the other ION (e.g. ION 214) in the dipole If the mes- 
45 sage was reliably delivered into the destination node memory, the host side then sets up a re-transmission timeout 
which is longer than the worst case I/O service times) to ensure the ION 212 successfully processes the message 
When th,s timer expires, the PIT message is resent by the compute node to the ION 212. If the I/O is still in progress' 
the duplicate request is simply dropped, otherwise the resent request is processed normally. Optionally, the protocol 
could also require an explicit acknowledge of the resent request to reset the expiration timer and avoid the trauma of a 
so failing the I/O to the application. 

[0146] FIG. 10 is a block diagram of the ION 212 functional modules. Input to the IONS 212 and 214 are data lines 
1002 and 1004. and control lines 1006. Each module in the ION 212 comprises a control module 1008 in communica- 
tion with control lines 1006. The control modules 1 008 accept commands from data lines 1002 and provide module con- 
trol functions. System function module 1010 implements the ION functions described herein. lONs 212 and 214 
55 comprise a fabric module 1020. a cache module 1014, a data resiliency module 1016, and a storage module 1018 
Each of these modules comprises a control module, a workload injector 1 020 for inserting and retrieving data from data 
lines 1002 and 1004, and a data fence 1022 for inhibiting the passage of data. 

[0147] After a PIT read request is sent to the ION 21 2, it is transferred to the workload injector of the ION cache mod- 
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uie 1014. The workload-injector inserts requests into an ION cache codule 1014 which may return the data directly if it 
was cached or allocates a buffer for the data and pass it on to the ION storage module 1018. The ION storage system 
module 1018 translates this request into one (or more) physical disk requests) and sends the request(s) to the appro- 
priate disk drive(s) 224. When the disk read operation(s) complete, the disk controller posts an interrupt to signal the 
s completion of the disk read. The ION workload-injector creates an I/O completion PIT packet. The debit PIT ID (stored 
in debit field 912) is the credit PIT ID (stored in credit field 910) from the SASE PIT in the read request (this is where 
the application wants the disk data placed). The credit PIT ID is either the same PIT ID the compute node 200 sent this 
request to, or a replacement PIT ID if that buffer is not free. This credit PIT will give the compute node credit for sending 
a future request (this current PIT request has just completed so it increases the queue depth for this compute node 200 
w to this ION 212 by one). There are three reasons why an ION 212 may not return a PIT credit after processing a PIT 
The first is that the ION 212 wants to reduce the number of outstanding requests queued from that compute node 200. 
The second reason is the ION 212 wants to redistribute the PIT credit to another compute node 200. The third reason 
is there may be multiple requests encapsulated into a single PIT packet (see the Super PIT packets discussion herein). 
The command field 914 is a read complete message and the argument is the return code from the disk drive read oper- 
as ation. This P IT packet is then queued to the BYNET interface 802 to be sent back to the compute node 200. The BYNET 
hardware then moves this PIT packet via a DMA to the compute node 200. This triggers the compute node 200 BYNET 
channel program to extract the debit PIT ID 912 and validate it before starting the DMA into the target PIT buffer (which 
in this case is the application's pinned buffer). When the DMA is completed, the compute node BYNET hardware trig- 
gers an interrupt to signal the application that the disk read has completed. On the ION 212, the BYNET driver returns 
20 the buffer to the cache system. 

[0148] The operations performed for a write request is similar to those performed for the read operation. The applica- 
tion calls the CN high level driver (VSIP), passing the address that contains the data, virtual disk name, disk block 
number, and data length. The CN high level driver selects a PIT ID 906 on the destination ION 212 and uses this data 
to create a PIT write request. The SASE PIT will contain only the return status of the write operation from the ION 212. 
25 At the ION 212, an interrupt is posted when the PIT packet arrives This request is processed the same way as a PIT 
read operation; the write request is passed to the cache routines that will eventually write the data to disk. When the 
disk write completes (or the data is safely stored in the write cache of both ION nodes 212 and 21 4), an I/O completion 
message is sent back to the compute node 200. When the ION 212 is running with write-cache enabled, the other ION 
214 in the dipole, rather than the ION 212 to which the request was sent, returns the I/O completion message. This is 
30 further described herein with respect to the Bermuda Triangle Protocol. 

Stale PIT IDs and Fault Recovery Issues 

[0149] The exchange of PIT IDs during first open is the mechanism through which stale PIT IDs 906 created by either 

35 a hardware or software failure are invalidated. Consider the situation where an ION 212 and a compute node 200 have 
exchanged PIT IDs and suddenly the ION 212 crashes. PIT IDs 906 represent target buffers pinned in memory and 
unless invalidated, outstanding PIT IDs 906 for either an ION 212 or a compute node 200 that has just rebooted could 
cause a significant software integrity problem, due to PIT IDs that are no longer valid, or stale. The BYNET hardware 
and the directed-band message support provide the essential mechanism for invalidating stale PIT IDs 906. 

40 [0150] At the end of the first open protocol, each side must give the CN high level driver a list of hosts to which PIT 
IDs 906 are distributed. Stated differently, the host is giving the CN high level driver a list of hosts from which it will 
accept PIT packets. The compute node high level driver then uses this list to create a table that controls the delivery of 
directed-band messages. This table specifies the combinations of ION 212 pairs that allow directed-band messages to 
be sent to each other. (The table can also specify one-way PIT message flows). The compute node high level driver 

45 keeps this table internally on the hosts (as data private to the driver) as part of the BYNET configuration process. Hosts 
can be added or subtracted from this list by the PIT protocol at any time by a simple notification message to the compute 
node high level driver. When a node fails, shuts down, or fails to respond the BYNET hardware detects this and will 
notify all the other nodes on the fabric. The BYNET host driver on each node responds to this notification and deletes 
ail references to that host from the directed-band host table. This action invalidates all PIT IDs 906 that host may have 

so distributed to any other host. This is the key to protecting a node from PIT packets previously distributed. Until the CN 
high level driver BYNET driver on that host has been reconfigured, the BYNET will fail all messages that are sent to that 
host. Even after first reconfiguration, until it is told by the local PIT protocol, the BYNET will not allow any directed-band 
message to be sent to this newly restarted or reconfigured host. This protects against the delivery of any stale PIT pack- 
ets until the PIT protocol has been properly initialized through the first open protocol. 

55 [0151] When a host attempts to send a directed-based message to an invalid host, (using a now invalidated PIT ID 
906) the send-side compute node high level driver refuses the message with an error condition to the sender. This 
rejection will trigger the first open handshaking to be invoked between the two nodes. After the first open handshaking 
completes, any I/O operations for the ION 212 that are still pending (from the perspective of the compute node) will have 
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neT^un ^ Un '^ S u th,s was a warm re - star ^ « is that the ION 212 was down for a long time so any 
SSTwSST?* WOU, ^ a ^ been restarted as part of fail-over processing and sent to the ofher ION 2i 2 Tn the 

TunJl^ ?^ °"r ' 0N faUlt hand ' ing for m ° re deteils )' ,f the crashed had been a compute node 200 
he unexpected arrival of a f.rst open request at the ION 212 for a compute node200 that had ake£?^*£jtoa 
first open will trigger PIT ID recovery operations. The ION 212 will invalidate all PIT IDs 906 Z ?*J 9 . 
node 200 (or in rea.ity wi,« probacy just reissue the o W ones). Any pJSSJS ^^SSS^^IJK 
allowed to complete (though this is an unlikely event unless the time for a node restert 

Super PIT (SPIT) - Improving Small I/O Performance 

[0152] The PIT protocol has an advantage over normal SCSI commands. Because the core of the present invention 
^ a communicaton network, not a storage network, the system can use network protocols to imp™ SSSSJS 
what a storage model would allow. Processing overhead of handling up-calls represents a performance wS for w^k 
loads dominated by small WO requests. There are severa. approaches to improving iSrS^£££^£ 

SSS '"J 0 ' mP T •* Path ' en9lh * ^ int6rrUPt hand ' ing ™ e second * to c °''aPse the vectoring muSe 
tK2?J!° a J' ngle ' nvocat ^ of the inJerr "Pt "andler using techniques similar to those7mp.oye^tdevice driver! 
The third .s to reduce the number of individual I/O operations and cluster (or convoy) them into a sinale reau2 Noriee 
wh.ch have to repackage incoming and outgoing data flows due to different limj d^Ta Ju^S^L-^ 
physical links tend to collect data. This problem I also worsened by speeTLismatches SZ^^SZSSZ 
tmation networks (especially where the destination network is slower). These ^es^™Samiv 

[0153] The present .nvention takes advantage of data convoys as a technique for reducinn th* ™ lff #» r nfVl? -.?' 

v^7Z7**TZ e 52 2 , 1 \ and the compute node 200 - B * ~* 

l?» J Vk M^ U,en ° de20a ,nthedebi ^redit model for flow control used by the present invention l/O^eauite 
COmPUt6 n0d6 200 * e ' 0N 212 - Queui " 9 StartS with ™ in ta ION 2 2aS 

flow J^T- qU Z" 9 . C ° ntinUeS ba * 3t th6 COmpUte n0de 200 This is an overftow conditbn Usualy ter 
flow occurs when a node has more requests than rt has PIT buffer credits. Each time an I/O completes the *ON^l2 

5r?i22f ^ Sa , 96 b l^° thS ""^ node 200 - Usual, K this com P'etion messageTndSJa £££ £ 
PIT buffer resource just released. This is the basis of the debit/credit flow control. When ttie sysfem is sw^mnpSuS 
^requests, each I/O completion is immediately replaced with a new I/O request at the .S^1 1 ThSefo^efp'T 
ods of heavy load. I/O requests flow one at a time to the ION 21 2. and queue in the ION 212 for an IntpSSSSffi 
E ? h ° reqUSStS Creates an up - ca " interri ** '^easing the load on the ION 212 ""sp^ed P*'Od. 

[0154] This dual queue model has a number of advantages. The number of PIT buffers allocated to a com™ nw(a 
^.c^tmdeolt There should be sufficient workload queued locaHy to the .O^IoSafwhen £Ss 

™ be t?^ rap,d,y ^ s P atehed - Howe «* ™™ry purees consumed by queued rJ^cJSJSiS 
may be better utilized if assigned to a cache system. When PIT queues on the ION 212 are kent short t™c a ^^f 

m£F£Z%F suffer if the J° h N 212 90es idle and has l ° wart for 

am fJn ?f T *T a5PSCt rt thS P ' T pr0tOC01 designed to teke advantage of the flow control of a debit/St sys- 
tem at high loads ,n order to reduce the number of up-call interrupts. Super-PIT improves the performance of SSp and 
similar workloads dominated by high rates of relatively small l/Os. .nstead of sending request one a,^ «me aTuSr 
PIT packet is a collection of I/O requests all delivered in a single, larger super-PIT request Each aunt Pi tZ 
transported the same way as a regular PIT buffer. Individual f/O requests conta nJd within SlS^ST^ 

hen extracted and inserted into the norma. ION 212 queuing mechanism by the PIT woT^adin^rZ^tu 212 

SSaT^rT aVa,lab,a ThSSe individUa ' 1/0 requests <» n be either r ead or write requests 
212 ip^r P,T ^? ad jN»rtor i acte as local proxy (on the ION 212) for application request transported to the ION 
SI c P'TworWoad-injector ,s also used by the RT-PIT and FRAG-PIT protocols discussed in a later sect tor .When 
the super-PIT ,s exhausted of individual requests, the resource is freed to the compute no7e and anoSuoeTpl? 
packet can be sent to replace it. The number of super-PIT packets allowed per host will oe determined at firs?™ 

bounded only by the size of the buffer to which the super-PIT is transported Super-PIT oackete onprnte h;«L^. * 
norma. PIT packets. In the present invention's control model. devicS can onl ^ aSS te^^TSl 
credit for the destination. The particu.ar PIT packet used by the device is J ?no parT^ncl n as tt dl^ce S no" 
targeting a specie application thread within the .ON 212. PIT packets to the ION 212 just S^tS^SS^ ^ 
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flew control as a side effect). In contrast, the SASE PIT within a PIT request is different. The SASE PIT ID represents 
an address space of an individual thread within the compute node212. Each request in the super-PIT contains a SASE 
PIT, but when the I/O they represent completes, the I/O completion message created does not include a credit PIT Only 
when the super-PIT has been drained of all requests, is a credit PIT issued for its address space. 

5 [01 58] The creation of a super-PIT on a compute node 200 occurs is described as follows. A super-PIT can be created 
whenever there are at least two I/O requests to a single ION 212 queued within the compute node 200. If the limit for 
super-PIT packets for that compute node 200 has already been reached on this ION 212, the compute node 200 will 
continue to queue up requests until a super-PIT ID is returned to it. The compute node 200 then issues another super- 
PIT message. Within the system driver, once queuing begins, per- ION queues will be required to create the super-PIT 

10 packets. 

[0159] As discussed above, super-PIT messages can reduce the processing load on an ION 212 under workloads 
that are dominated by a large volume of small I/O requests. Super-PIT messages improve the performance of the des- 
tination node and improve the utilization of the interconnect fabric 106 due to an increase in average message size. 
However, the concept of super-PIT messages can be applied at the ION 212 to reduce the load on the compute node 

is 200 created by small I/O workloads as well. Creating super-PIT messages on the ION 21 2 is a far different problem than 
creating them on the compute node 200. On the compute node 200, application threads creating I/O requests are sub- 
ject to flow control to prevent the ION 212 from being overwhelmed. The service rate of the disk subsystem is far lower 
than the rest of the ION 212 and will always be the ultimate limitation for ION 212 performance. Requests are blocked 
from entering the system until the ION 212 has sufficient resources to queue and eventually service the request. The 

20 point is that requests would queue on the compute node (or the application would be blocked) until resources are avail- 
able on the ION 212. Resource starvation is not an issue on the compute node 200. When a compute node 200 appli- 
cation submits a request for I/O to the system, included as part of the request are the compute node 200 memory 
resources required to complete the I/O (the application thread buffer). For every I/O completion message the ION 212 
needs to send to the compute node 200, it already has an allocated PIT ID (the SASE PIT ID). From the viewpoint of 

25 the ION 212, I/O completion messages already have the target buffer allocated and can be filled as soon as the data is 
ready. The I/O completion message is successful once it has been delivered (the ION 212 does not have to wait for the 
service time of a disk storage system at the compute node). Hence, the ION 212 cannot block due to flow control pres- 
sure from a compute node. To create super-PIT messages, the compute node took advantage of flow control queuing, 
an option the ION 212 does not have. Since the ION 212 does not have any resources to wait for, other than access to 

30 the BYNET, the opportunity to create super-PIT messages is far less. 

[0160] Several approaches for creating super-PIT messages on the ION 212 may be employed. One approach is to 
delay I/O completion requests slightly to increase the opportunity of creating a super-PIT packet. If after a small delay, 
no new completion messages for the same node are ready, the message is sent as a normal PIT message. The prob- 
lem with this technique is that any amount of time the request is delayed looking to create a super-PIT (to reduce up- 

35 call overhead on the compute node), there is a corresponding increase in total request service time. The net effect is a 
reduced load on the compute node 200, but may also slow the application. An adaptive delay time would be beneficial 
(depending on the average service rate to a compute node 200 and the total service time accumulated by a specific 
request). The second approach is a slight variation of the first. This would require each compute node 200 to supply 
each ION 212 with a delay time that would increase as the small I/O rate at the compute node increases. The point is 

40 to increase the window for creating super-PIT messages for a specific ION 212 when it is needed. The third approach 
would be to delay certain types of traffic such as small read or writes that were serviced directly by the cache and did 
not involve waiting for a storage 224 disk operation. While the cache reduces the average I/O latency through avoiding 
disk traffic for some percentage of the requests, the distribution of latencies is altered by cache hits. A small queue 
delay time for a cache hit request would not be a major increase in service time compared to that which included a disk 

45 operation. For those applications that are sensitive to service time distribution (where uniform response time is impor- 
tant to performance), a small delay to create a super-PIT packet on the ION 212 has the potential to improve overall 
system performance. 

Large Block Support and Fragmented PIT Packets 

so 

[0161 ] Performance requirements for database applications are often independent of the size of the database. As the 
size of the database increases, the rate at which disk storage is examined must also increase proportionately to prevent 
erosion in application performance. Stated differently, for customer databases to grow in size, response time has to 
remain constant for a given query. The difficulty in meeting these requirements is that they are in direct conflict with the 
55 current trend in disk drive technology: disk drives are increasing in capacity, while their random I/O performance is 
remaining constant. One approach to mitigate this trend is to increase the average size of disk I/O operations as the 
capacity of the disk drive increases. Based on the current trends in storage capacity and the performance requirements, 
the average I/O size of 24 KB may increase to 128 KB in the very near future. More aggressive caching and delayed 
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write techniques may also prove to be helpful for many workloads. Uneven technology growth in disk drives is not the 
only driver behind .ncreasing I/O request sizes. As databases with BLOBS (binary large objects) start to become pop- 
ular, objects with sizes reaching 1 MB and higher are becoming more common. Regardless of the specific cause it is 
expected that systems will need to support large I/O objects whose size will continue to track the economics of disk stor- 
age. 

[0162] There are several issues related to the transmission of large data objects between the ION 212 and compute 
nodes 200 using the PIT protocol. As described herein, the advantage of the PIT protocol is the pre-allocation of destj- 
naton buffers to address the problems of flow control and end-point location. However, up-call semantics also require 
the identification (or allocation) of sufficient buffer space in which to deposit the message. The PIT protocol addresses 
th.s problem by having the send-side select the target PIT ID 906 where each message is to be deposited at the 
r t Ce o?-^ ,^ r9e 1/0 Writ6S Cl6arly com P |icate the P roto «". as message size could become a criteria for selecting a spe- 
cif <c PIT ID 906 out of an available pool. Under periods of heavy load, there is the potential for situations where the 
sender has available PIT IDs 906 credits, but none of them meet the buffer size requirement for a large I/O request 
Under the P IT protocol, if there is a wide population of data sizes to be sent, the send-side has to work with the receive- 
s.de to manage both the number and size of the PIT buffers. This creates a PIT buffer allocation size problem that is 
when creating a pool of PIT buffers, what is the proper distribution of buffer sizes for a pool of PIT buffer under a giver! 
workload? BYNET software imposes an additional maximum transfer unit MTU limit that complicates large I/O reads in 
addition to writes. I/O requests (both read and write) that exceed the BYNET MTU must be fragmented by the software 
protocol (the PIT protocol in this case) on the send-sideand reassembled on the destination side. This creates the prob- 
lem of memory fragmentation. Briefly, internal fragmentation is wasted space inside an allocated buffer. External 
fragments is wasted space outside the allocated buffers that are too small to satisfy any request. One solution 
would be to use only part of a larger PIT buffer, but this would cause unnecessary internal fragmentation if larger PIT 
buffers are used. Large PIT buffers wastes memory which hurts cosVperformance 

[0163] In the present invention, the BYNET MTU and the PIT buffer size allocation problem is solved with the addition 
of two more types of PIT messages: the RT-PIT (round trip PIT) and the FRAG-PIT (fragmented PIT) Both the FRAG- 
r T*, nd i * he J RT - piT use a <fefapu// model instead of the PIT data push model. (To push data, the send-side pushed the 
data to the destination. To pull data, the destination pulls the data from the source). FRAG-PIT messages are designed 
to support large data reads, while RT-PIT messages support large data writes. Both FRAG-PIT and RT-PIT are similar 
to super-PIT as they also use the ION PIT workload-injector to manage the flow of data. 

RT-PIT Messages 

[0164] When a compute node 200 wants to perform a large disk write operation to an ION 21 2 and the I/O write is 
greater m size than either the BYNET MTU or any available ION 212 PIT buffer, the compute node 200 will create a RT- 
PIT create message. A RT-PIT message operates in two phases: the boost phase followed by the round trip phase In 
the boost phase, a list of source buffers for the data to be written is assigned a series of PIT IDs on the compute node 
200. The fragmentation size of the source buffer is determined by the BYNET MTU and the size constraints that were 
speeded during the ION first open protocol. This list of PIT IDs (with the corresponding buffer size) are placed in the 
payload of a single RT-PIT request message and will be PIT credits to destination ION 212. An additional PIT buffer is 
allocated from the compute node pool to be used directly by the RT-PIT protocol. The PIT ID of this additional buffer is 
placed in the credit field of the PIT header. The rest of the RT-PIT request is the same as a normal PIT write message 
The compute node 200 then sends (boosts) this RT-PIT request message to the ION 212 

[0165] At the ION 212. the PIT workload-injector processes the RT-PIT request message in two steps. For each 
source s.de PIT ID 906. the workload-injector must request a PIT buffer from the ION cache that will match it in size 
(Note this can be done all at once or one at a time depending on the memory space available in the ION buffer cache)' 
By matching the PIT buffers, the ION 212 will dynamically allocate resources to match the write request I/O can now 
proceed using a modified sequence of normal PIT transfers. Processing of the RT-PIT message now enters the round- 
trip phase where the workload-injector creates a RT-PIT start message for one (or more) matching pair(s) of source and 
o^! m ^! 0n (The option of sending one or a subset of matched PIT IDs remains at the discretion of the ION 

foMo^ T * P JI !° S 906 in 3 Si " 9le RT " P ' T Stert m6Ssa9G controls the granularity of data transfer inside the 
ION 212 (as discussed below). 

[0166] This RT-PIT start message is sent back to the compute node 200. ending the boost phase of the RT-PIT mes- 
sage. On receipt of the RT-PIT start message, the compute node 200 starts to transfer the data to the ION 212 one PIT 
pair at a time using a normal PIT write message. The fragments do not have to be sent in-order by the compute node 
200. as both the compute node 200 and ION 212 have sufficient data to handle lost fragments (the matched PIT oair 
specifies re-assembly order). When the ION 212 receives the PIT write message, the workload-injector is notified 
wh.ch recognizes that this write request is part of a larger RT-PIT I/O operation. The worWoad-injector has two options 
for processing the PIT write: either pass the fragment to the cache routines to start the write operation or wait for the 



28 

BNSDOCID: <EP 093S200A1 l_> 



EP 0 935 200 A1 



transmission of the last fragment before starting the write. Starting the I/O early may allow the cache routines to pipeline 
the data flow to the disk drives (depending on the write cache policy), but risks a performance loss from the smaller I/O 
size. However, holding the I/O until all the fragments have arrived may place an undue burden on the cache system. 
Since the total size and number of fragments are known from the start, all the data needed to optimize the large I/O 

5 request under the current operating conditions is made by the cache system. On the compute node 200 side, the suc- 
cessful transmission of each PIT write operation causes the start of the next fragment write to commence when multiple 
fragments are contained in a single RT-PIT start message. When the last fragment in a single RT-PIT start command 
has been received, the request-injector passes the data to the cache system for processing similar to that of a normal 
write request. When the data is safe, an I/O completion message is created by the cache system and is sent back to 

w the compute node 200 to signal the completion of this phase of processing (for the RT-PIT start operation). When there 
are more fragments remaining, another RT-PIT start command is created and sent to the compute node, thus repeating 
the cycle described above until all the fragments have been processed. When the workload-injector and the cache have 
completed the processing of the last fragment, a final I/O completion message with status is returned to the compute 
node to synchronize the end of all the processing for the RT-PIT request. 

is [01 67] RT-PIT messages could be optimized with some changes to the BYNET. Consider the situation where the ION 
212 has just received a RT-PIT request; the workload-injector on the ION 212 is matching up buffers on the compute 
node with the ION 212 to translate the large I/O request into a number of smaller normal write requests. The synchro- 
nization is performed through the intermediate RT-PIT start commands. However, if the BYNET allowed a received 
channel program to perform a data pull, the intermediate step of sending a RT-PIT start command to the compute node 

20 could be eliminated. For the sake of discussion, we will call this mode of BYNET operation a loop-band message. A 
loop-band message is really two directed-band messages, one nested inside of the other. By way of example, when the 
workload-injector receives a RT-PIT request, it will process each fragment by creating a RT-PIT start message that con- 
tains the data needed to create a second PIT write message on the compute node. The RT-PIT start message transfers 
the template for the PIT write operation for a fragment to the compute node 200. The channel program executed on the 

25 compute node 200 (sent with the RT-PIT start message) deposits the payload on the send queue on the compute node 
BYNET driver. The payload looks like a request queued from the application thread that made the initial RT-PIT request. 
The payload will create a PIT write request using the pair of PIT IDs. source and destination, for this fragment sent by 
the workload-injector. The P IT write will deposit the fragment on the ION 21 2 and will notify the workload-injector it has 
arrived. The workload-injector will continue this cycle for each fragment until all has been processed. The performance 

30 improvement of loop-band messages is derived from the removal of the interrupt and compute node processing 
required for each RT-PIT start message. 

[01 68] FRAG-PIT messages are designed to support the operation of large I/O read requests from a compute node. 
When an application makes a large I/O read request, the compute node pins the target buffer and creates a list of PIT 
IDs that represent the target buffers of each fragment. Each PIT ID describes a scatter list comprised of the target 
35 buffer(s) for that fragment and an associated status buffer. The status buffer is updated when the data is sent, allowing 
the compute node to determine when each fragment has been processed. The size of each fragment is determined 
using the same algorithm as RT-PIT messages (see the section on RT-PIT above). These fields are assembled to create 
a FRAG-PIT. 

[0169] The compute node 200 sends the FRAG-PIT request to the ION 212 where it is processed by the worWoad- 
40 injector. Included in this request are the virtual disk name, starting block number, and data length of the data source on 
the ION 212. The workload-injector operates on a FRAG-PIT request in a manner similar to a RT-PIT request Each 
fragment within the FRAG-PIT request is processed as a separate PIT read request in cooperation with the cache sys- 
tem. The cache system can choose to handle each fragment independently or as a single read request, supplying the 
disk data back to the workload-injector when it is available. When a data fragment is supplied by the cache (either indi- 
45 vidually or part of a single I/O operation) , the data for the large read request will begin to flow back to the compute node. 
For each fragment where the cache has made data available, the workload-injector sends that data fragment in a 
FRAG-PIT partial-completion message back to the compute node. Each FIG-PIT partial-completion message transmits 
data similar to a regular PIT read request completion except that the FRAG-PIT partial-completion message will not 
generate an interrupt at. the compute node when it is delivered—The last completed fragment is returned to the compute 
so node with a FRAG-PIT full -completion message. A FRAG-PIT full-completion differs from a partial-completion message 
in that it signals the completion of the entire FRAG-PIT read request via an interrupt (a full up-call). 

Implementation of a PIT Protocol on Other Network Devices 

£5 [01 70] Much of the performance of the foregoing approach to network attached storage rests on the ability of the inter- 
connect fabric 106 to support the PIT protocol. In the case of the BYNET, a low-level interface was created that is a 
close match for the PIT protocol. Other network interfaces, such as fibre channel are capable of supporting the PIT pro- 
tocol as well. 
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[0182] Read and write requests for virtual disk blocks arrive to the ION 212 via the interconnect fabric 106. Requests 
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may be routed to a specific ION 212 through source initiated selection at the compute nodes 200. Every compute node 
200 knows which ION 212 will be accepting requests for each fabric virtual disk in the system. A fabric virtual disk 
reflects a virtual disk model in which a unique storage extent is represented, but that storage extent does not imply nor 
encode physical locations of the physical disk(s) within the name. 

[0183] Each compute node 200 maintains a list that maps fabric virtual disk names to ION dipoles 226. The list is 
created dynamically through coordination between the compute nodes 200 and lONs 212. During power up and fault 
recovery operations, the lONs 21 2 within a dipole 226 partition the virtual (and physical) disks between them and create 
a list of which virtual disks are owned by which ION 212. The other ION 214 (which does not own the virtual disk or 
storage resource) in the dipole 226 provides an alternative path to the virtual disk in case of failure. 
[0184] This list is exported or advertised periodically across the interconnect fabric 106 to all of the other dipoles 226 
and compute nodes 200. Compute nodes 200 use this data to create a master table of primary and secondary paths to 
each virtual disk in the system. An interconnect fabric driver within the compute node 200 then coordinates with the 
dipole 226 to route I/O requests. Dipoles 226 use this "self discovery" technique to detect and correct virtual disk nam- 
ing inconsistencies that may occur when dipoles 226 are added and removed from an active system. 
[0185] Applications running on the compute nodes 200 see a block interface model like a local disk for each fabric 
virtual disk that is exported to the compute node 200. As described earlier herein, the compute nodes 200 create an 
entry point to each fabric virtual disk at boot time, and update those entry points dynamically using a naming protocol 
established between the compute nodes 200 and the lONs 212. 

Server Management 

Overview 

[0186] An important aspect of the present invention is its management, which is a subset of overall management 
referred to as system management or systems administration. This subset is called server management for storage 
(SMS). Management of storage-related hardware and software components as well as the placement of data entities 
within the available storage space are implemented through this facility. Management actions can be initiated by an 
administrator or dynamically invoked upon the occurrence of some event in the system. Management commands can 
be entered and acknowledged almost instantaneously, but the results of a single, simple command might easily affect 
a large number of system components for a significant period of time. For example, to move a file system from one ION 
212 to another ION may take many minutes or even hours to complete, and affect multiple lONs 212 and the Compute 
Node(s) 200 that wish to use the subject file system. Server management is also responsible for providing the admin- 
istrator with informative and warning messages about the state of system hardware and software. 
[0187] The administrator perceives the system primarily through a series, of screen display "views". Several views of 
the overall system may be presented. The primary view is a hierarchical view, at the top level all compute nodes 200, 
lONs 212, and fabrics 106 within the system are shown. Drill-down techniques permit more detailed displays of items 
of interest. Most systems are large enough that the size and complexity can not be rendered onto a single display page. 
Graphical views are rendered showing either a physical (geographic) or a logical view. Individual entities or groups of 
entities can be selected for more detailed viewing and administration, and results of requests can be displayed in user- 
selected formats. 

[0188] A tabular method of presentation is also provided, and individuals or groups can be viewed and administered 
in this view. An important aspect of this management is the presentation of the path of a particular piece of data from a 
particular Compute Node 212 through to the physical storage disk(s) 224, which contain it. This path is presented in 
tabular form displaying its resilience - that is, how many separate component failures will it take before the data 
becomes unavailable. 

Volume Set Creation 

[0189] Creating a volume set (VS) allocates free space to be used by a host compute node 200 application 204. Vol- 
ume sets are based within an ION 212 and have names (the VSIs 602 described herein), sizes, and RAID (redundant 
array of inexpensive disks) data protection levels. The system administrator creates the VS based on requirements and 
may specify location and redundancy characteristics. Multiple VSs may be created with group operations. 
[0190] In summary, a highly-scalable parallel processing computer system architecture is described. The parallel 
processing system comprises a plurality of compute nodes for executing applications, a plurality of I/O nodes, each 
communicatively coupled to a plurality of storage resources, and an interconnect fabric providing communication 
between any of the compute nodes and any of the I/O nodes. The interconnect fabric comprises a network for connect- 
ing the compute nodes and the I/O nodes, the network comprising a plurality of switch nodes arranged into more than 
g(log b N) switch node stages, wherein b is a total number of switch node input/output ports,N is a total number of net- 
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Claims 

1 . A parallel processing system, comprising: 



a plurality of compute nodes for executing applications- 

a plurality of input/output (I/O) nodes, each communicatively coupled to a plurality of storaoe resource- a „H 
interconnect fabric providing communication between any' of th'e conJL^^S^^ 

wherein the interconnect fabric comprises a network for connecting the compute nodes and I/O nodes via „ 

swiS^: 6 ^ } t ports and a piuraiity ° f netw ° rk p«* *• ^^s^.^SEri 

swrtch nodes arranged into more than g(log b N) switch node stages, wherein b is a total number of ^itah node 
■nput/output ports. N is a total number of network input/output ports and g(x) indicates a l^ZTES?^ 

2X1 ma,,est in l 9er : ot less than the ar9ume * x - n^-SS^S 25 ^TSZ 

paths between any network input port and network output port, the switch nooe staaes beina o££5*T£ 

ll*« P « ? 1 b ° UnC6baCk P ° intS 3t 3 hi9h9St Switeh node sta 9* °< »• n^ 8 ^3£3SIlE 

4. A parallel processing system, comprising: 

a plurality of compute nodes for executing applications- 

a plurality of input/output (I/O) nodes, each managing a communicatively coupled plurality of storaoe 

ZESTiS^* ? T projectin9 an image * stora9e objects stored ° n ««wi^2ui?ES 

I«X^r ' nterCOnneCt "* Pr ° Mdin9 — ^n between ly of the S£2 

5. The parallel processing system of claim 4, wherein the I/O nodes comprise: 

means for generating a globally unique identification for a data object stored on the storaoe resource «nri w 
binding the globally unique identification to the data object- 9 resource, and for 

means for exporting the global.y unique identification to all of the compute nodes via the communication fabric; 

6. The parallel processing system of claim 5, wherein the compute nodes comprise means for creatino a io«,i «*n. 
pent -n the compute node for the data object based on the globally unique SSSS! * 

7. The parallel processing system of daim 6, wherein: 

each compute node comprises an associated operating system; and 



32 

EP O935200A1J > 



EP 0 935 200 A1 



the globally unique identification comprises operating system dependent data enabling use of the data identi- 
fied by the globally unique identification by the operating system of the compute node. 

8. The apparatus of claim 4 or claim 5, wherein the I/O nodes are organized into plurality of cliques, each comprising 
a primary I/O node and a secondary I/O node, wherein 

the primary I/O node generates and exports a globally unique identification; and 

the secondary I/O node exports a globally unique identification when the primary I/O node is inoperative. 

9. The apparatus of claim 8, wherein the secondary I/O node retrieves the data object from the storage resource in 
response to a request from the compute node transmitted over the interconnect fabric when the primary I/O node 
is inoperative. 

10. The apparatus of claim 4, wherein: 

the interconnect fabric communicates data transfer messages among the compute nodes and the I/O nodes, 
the data transfer messages comprising a destination channel program, and 

the apparatus further comprises an communication fabric interface, coupled between the compute nodes and 
the interconnecting fabric and the I/O nodes and the interconnect fabric, the interface comprising a channel 
processor for executing the destination channel program to deliver the data transfer message. 

11. The apparatus of claim 10, wherein the data transfer messages further comprise a sending destination channel 
program that for creating a communication circuit in the interconnect fabric, transmitting the data transfer message, 
and the shutting down the communication circuit in the interconnect fabric. 
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